Creating Custom pipeline
-
On the DataGOL Home page, from the left navigation panel, click Lakehouse > Pipelines.
-
In the Pipelines page, from the upper right corner, click + New Pipeline button.
-
Select Custom for tailored data extraction with user-defined queries.
-
Provide a name for the Output table.
-
Define your data with precision by creating custom queries to suit your specific needs. You have the flexibility to view your data sources, add queries, format your queries and also run, format, and test your queries.
-
From the Destination drop-down, select a destination for the pipeline to specify where the processed data will be delivered. You can also select the + New Warehouse button to create a new destination for the pipeline.
-
From the Settings drop-down, customize the pipeline parameters specifically for your needs.
-
The Format field is populated with the format that is selected for the source. You cannot change the format after the pipeline is created.
-
Choose the query engine to execute the data transformations and processing within your pipeline. This can be either Spark or Athena. You cannot change the query engine after the pipeline is created
-
The frequency at which a data pipeline runs is determined by the chosen replication method. This dictates how and when the system executes the pipeline to replicate data. There are a few key methods:
-
Manual (On-Demand): With this method, the pipeline only runs when a user explicitly initiates it. It requires manual intervention each time data replication is needed. If no manual trigger occurs, the pipeline remains inactive.
-
Cron: This method allows for highly specific scheduling using cron expressions. You can define precise times, days of the week, and even specific minutes or seconds for the pipeline to run automatically. For example, a cron expression could be set to run the pipeline every Sunday and Thursday at a particular time.
-
Scheduled: This method offers predefined intervals for automatic pipeline execution. Users can typically select options like running the pipeline every hour, every 3 hours, daily, weekly, monthly, or yearly. Once a schedule is set and submitted, the pipeline will automatically run at the specified frequency, regardless of the system's status at that exact moment.
-
-
-
Optionally, you can select the Would you like to run this pipeline immediately? checkbox, to run the pipeline immediately after the creation.
-
Click Submit.